Conversation
|
Nope. Tests still failing. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3430 +/- ##
============================================
- Coverage 76.82% 76.55% -0.28%
============================================
Files 159 159
Lines 79704 79704
============================================
- Hits 61232 61015 -217
- Misses 18472 18689 +217 🚀 New features to boost your workflow:
|
The slave buffer test sends 1M pipelined commands without reading replies, causing the server's client output buffer to grow. On slow CI runners this can lead to TCP backpressure and I/O errors. Use CLIENT REPLY OFF so the server doesn't generate replies, avoiding the output buffer buildup entirely. This matches the pattern used in commit 87d2330 for similar tests in memefficiency.tcl. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…ldup The test_memory_efficiency proc sends 10K pipelined SET commands with values up to 16KB without reading replies. This causes the server's client output buffer to grow, leading to TCP backpressure and connection timeouts on slow CI runners. Use CLIENT REPLY OFF to suppress reply generation, matching the pattern used in the slave buffer test fix. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The wait_for_condition checking key consistency in test_migrated_replica and test_sub_replica can throw an unhandled MOVED exception when cluster nodes haven't fully updated their slot routing after failover. This crashes the entire test run, taking down all parallel tests. Wrap the condition in catch so MOVED errors are treated as not ready yet and retried. Also wrap the debug prints in the else clause in catch since they can throw the same error. This turns a test runner crash into a proper test failure if the condition is never met. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Same fix as for test_sub_replica: wrap the key consistency check in catch so MOVED errors during cluster convergence are retried instead of crashing the test runner. Also wrap debug prints in catch. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The test waits for 'Failover attempt expired' with 1000*10ms = 10s, but the default cluster-node-timeout in start_cluster is 3000ms, so auth_timeout is 6s plus ~3s for failure detection, totaling ~9s. This is barely enough and fails on slow CI runners. Increase to 1200*50ms = 60s to provide sufficient margin. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
d288224 to
587566f
Compare
| if {$k % 10000 == 0} {$rd_master flush} | ||
| } | ||
| for {set k 0} {$k < $cmd_count} {incr k} { | ||
| $rd_master read | ||
| } | ||
| $rd_master client reply on | ||
| $rd_master flush | ||
| $rd_master read ;# read the +OK from CLIENT REPLY ON |
There was a problem hiding this comment.
Is flush correct here, shouldn't we do client_reply_off_wait_for_server $rd_master like in 87d2330?
There was a problem hiding this comment.
client_reply_off_wait_for_server waits for the server to consume and execute the commands.
I'm not sure it's necessary here. If the replies build up in the $rd_master's incoming TCP buffer, $rd_master will not be able to handle TCP-layer metadata traffic, but now I guess the kernel can handle those TCP sequence numbers and ACK messages better. I'm not 100% sure though.
The test looks up the replica's main-channel connection id after writing 50MB of data. On slow CI runners, the replica connection may have been disconnected by the output buffer soft limit (64MB/60s) before the lookup, causing get_client_id_by_last_cmd to return empty. Two changes: - Move the connection id lookup before the write loop, while the sync is known to be in progress. - Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is sufficient and avoids approaching the output buffer limit. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Fixing multiple flaky tests.
slave buffer are counted correctly in tests/unit/maxmemory.tcl
Memory efficiency with values in range * in tests/unit/memefficiency.tcl
These tests send large numbers of pipelined commands using deferring
clients without reading replies, causing the server's client output
buffer to grow. On slow CI runners, this leads to TCP backpressure and
I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to
suppress reply generation, matching the pattern from commit 87d2330.
---
Sub-replica reports zero repl offset and rank, and fails to win election
in tests/unit/cluster/replica-migration.tcl
New non-empty replica reports zero repl offset and rank, and fails to
win election in tests/unit/cluster/replica-migration.tcl
In the replica-migration tests, a MOVED errors results in an Tcl
exception. After failover, wait_for_condition blocks issue GET commands
to cluster nodes that may not have fully updated their slot routing. An
unhandled MOVED exception crashes the test runner. Fix: Wrap the
condition in catch so MOVED errors are retried. Also wrap debug prints
in the else clause. Fixes the following tests:
---
Replica can update the config epoch when trigger the failover -
automatic in tests/unit/cluster/failover2.tcl
Increase wait timeout for failover expiry. The test waits 10 seconds for
"Failover attempt expired", but the default cluster-node-timeout in
start_cluster is 3000ms, making auth_timeout 6 seconds plus ~3
seconds for failure detection — barely fitting in 10 seconds and failing
on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms
(60 seconds).
---
dual-channel-replication lazyfree test
The test looks up the replica's main-channel connection id after writing
50MB of data. On slow CI runners, the replica connection may have been
disconnected by the output buffer soft limit (64MB/60s) before the
lookup, causing get_client_id_by_last_cmd to return empty. Two changes:
1. Move the connection id lookup before the write loop, while the sync
is known to be in progress.
2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough
data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is
sufficient and avoids approaching the output buffer limit.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Some test cases write thoughsands of commands in a pipeline and afterwards read the replies. This can lead to TCP ACK being dropped and the connection broken. CLIENT REPLY OFF prevents this. "Main db not affected when fail to diskless load" in cluster/diskless-load-swapdb has been observed to be flaky. The others are just defensive but they follow the same pattern. Similar fixes in the past: #3430, #2483. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
|
I think we need to backport this PR to stabilize weekly tests. Adding it to 9.0, 8.1, 8.0 and 7.2. It should help with the weekly tests that have been failing. |
Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0). Resolved memefficiency.tcl conflict: kept $rd close after new CLIENT REPLY pattern. Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
Partial cherry-pick of f3b6470 from unstable. Only maxmemory.tcl and memefficiency.tcl apply to 7.2. Skipped: dual-channel-replication.tcl, failover2.tcl, replica-migration.tcl (tests/features don't exist on 7.2). Adapted: valkey_deferring_client -> redis_deferring_client. Kept $rd close in memefficiency.tcl (present on 7.2). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.1). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 9.0). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com>
Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 9.0). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.1). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0). Resolved memefficiency.tcl conflict: kept $rd close after new CLIENT REPLY pattern. Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Partial cherry-pick of f3b6470 from unstable. Only maxmemory.tcl and memefficiency.tcl apply to 7.2. Skipped: dual-channel-replication.tcl, failover2.tcl, replica-migration.tcl (tests/features don't exist on 7.2). Adapted: valkey_deferring_client -> redis_deferring_client. Kept $rd close in memefficiency.tcl (present on 7.2). Fixes test_slave_buffers and test_memory_efficiency flaky failures by using CLIENT REPLY OFF to prevent TCP backpressure. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Fixing multiple flaky tests.
slave buffer are counted correctly in tests/unit/maxmemory.tcl
Memory efficiency with values in range * in tests/unit/memefficiency.tcl
These tests send large numbers of pipelined commands using deferring
clients without reading replies, causing the server's client output
buffer to grow. On slow CI runners, this leads to TCP backpressure and
I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to
suppress reply generation, matching the pattern from commit 87d2330.
---
Sub-replica reports zero repl offset and rank, and fails to win election
in tests/unit/cluster/replica-migration.tcl
New non-empty replica reports zero repl offset and rank, and fails to
win election in tests/unit/cluster/replica-migration.tcl
In the replica-migration tests, a MOVED errors results in an Tcl
exception. After failover, wait_for_condition blocks issue GET commands
to cluster nodes that may not have fully updated their slot routing. An
unhandled MOVED exception crashes the test runner. Fix: Wrap the
condition in catch so MOVED errors are retried. Also wrap debug prints
in the else clause. Fixes the following tests:
---
Replica can update the config epoch when trigger the failover -
automatic in tests/unit/cluster/failover2.tcl
Increase wait timeout for failover expiry. The test waits 10 seconds for
"Failover attempt expired", but the default cluster-node-timeout in
start_cluster is 3000ms, making auth_timeout 6 seconds plus ~3
seconds for failure detection — barely fitting in 10 seconds and failing
on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms
(60 seconds).
---
dual-channel-replication lazyfree test
The test looks up the replica's main-channel connection id after writing
50MB of data. On slow CI runners, the replica connection may have been
disconnected by the output buffer soft limit (64MB/60s) before the
lookup, causing get_client_id_by_last_cmd to return empty. Two changes:
1. Move the connection id lookup before the write loop, while the sync
is known to be in progress.
2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough
data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is
sufficient and avoids approaching the output buffer limit.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Some test cases write thoughsands of commands in a pipeline and afterwards read the replies. This can lead to TCP ACK being dropped and the connection broken. CLIENT REPLY OFF prevents this. "Main db not affected when fail to diskless load" in cluster/diskless-load-swapdb has been observed to be flaky. The others are just defensive but they follow the same pattern. Similar fixes in the past: valkey-io#3430, valkey-io#2483. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Cherry-pick of f3b6470 from unstable. Skipped dual-channel-replication.tcl (lazyfree test doesn't exist on 8.0). Resolved memefficiency.tcl: kept $rd close after CLIENT REPLY pattern. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Partial cherry-pick of f3b6470 from unstable. Only maxmemory.tcl and memefficiency.tcl apply to 7.2. Adapted: valkey_deferring_client -> redis_deferring_client. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
|
@zuiderkwast Hey Viktor, another failing test from #3523 for this flaky one: Then I re-ran it and it passed, but it did fail sometimes though. |
Use CLIENT REPLY OFF pattern in the defrag eval scripts test which pipelines 50,000 script loads + 50,000 set commands. Without this, the TCP send buffer fills up causing 'I/O error reading reply'. Same CLIENT REPLY OFF pattern as valkey-io#3430 and valkey-io#3452. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Use CLIENT REPLY OFF pattern in the defrag eval scripts test which pipelines 50,000 script loads + 50,000 set commands. Without this, the TCP send buffer fills up causing 'I/O error reading reply'. Same CLIENT REPLY OFF pattern as valkey-io#3430 and valkey-io#3452. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> (cherry picked from commit 6047b58)
Fixing multiple flaky tests.
These tests send large numbers of pipelined commands using deferring clients without reading replies, causing the server's client output buffer to grow. On slow CI runners, this leads to TCP backpressure and I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to suppress reply generation, matching the pattern from commit 87d2330.
In the replica-migration tests, a MOVED errors results in an Tcl exception. After failover, wait_for_condition blocks issue GET commands to cluster nodes that may not have fully updated their slot routing. An unhandled MOVED exception crashes the test runner. Fix: Wrap the condition in catch so MOVED errors are retried. Also wrap debug prints in the else clause. Fixes the following tests:
Increase wait timeout for failover expiry. The test waits 10 seconds for "Failover attempt expired", but the default cluster-node-timeout in start_cluster is 3000ms, making
auth_timeout 6 seconds plus ~3 seconds for failure detection — barely fitting in 10 seconds and failing on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms (60 seconds).
The test looks up the replica's main-channel connection id after writing 50MB of data. On slow CI runners, the replica connection may have been disconnected by the output buffer soft limit (64MB/60s) before the lookup, causing get_client_id_by_last_cmd to return empty. Two changes: